Fast Sparse Superposition Codes have 
Exponentially Small Error Probability for R < C 

Antony Joseph, Student Member, IEEE, and Andrew R Barron, Senior Member, IEEE 



Abstract — For the additive white Gaussian noise channel with 
average codeword power constraint, sparse superposition codes 
are developed. These codes are based on the statistical high- 
dimensional regression framework. The paper [IEEE Trans. 
Inform. Theory 55 (2012), 2541 - 2557] investigated decoding 
using the optimal maximum-likelihood decoding scheme. Here 
a fast decoding algorithm, called adaptive successive decoder, is 
developed. For any rate R less than the capacity C communication 
is shown to be reliable with exponentially small error probability. 

Index Terms — gaussian channel, multiuser detection, succes- 
sive cancelation decoding, error exponents, achieving channel 
capacity, subset selection, compressed sensing, greedy algorithms, 
orthogonal matching pursuit. 

I. Introduction 

The additive white Gaussian noise channel is basic to Shan- 
non theory and underlies practical communication models. 
Sparse superposition codes for this channel was developed 
in ll24l . where reliability bounds for the optimal maximum- 
likelihood decoding were given. The present work provides 
comparable bounds for our fast adaptive successive decoder. 

In the familiar communication setup, an encoder maps 
length K input bit strings u = (ui, U2, • • • , uk) into 
codewords, which are length n strings of real numbers 
c\, C2, . . . , c n , with power (l/n) Y^i=i c i- After transmission 
through the Gaussian channel, the received string Y = 
(Yi, Y 2 , . . . , Y n ) is modeled by, 

Yi = + £j for i = 1, . . . , n, 

where the ei are i.i.d. N(0,a 2 ). The decoder produces an 
estimates u of the input string u, using knowledge of the 
received string Y and the codebook. The decoder makes a 
block error if u ^ u. The reliability requirement is that, with 
sufficiently large n, that the block error probability is small, 
when averaged over input strings u as well as the distribution 
of Y. The communication rate R = K/n is the ratio of the 
number of message bits to the number of uses of the channel 
required to communicate them. 

The supremum of reliable rates of communication is the 
channel capacity C = (1/2) log 2 (l + P/a 2 ), by traditional 
information theory |30|,|16|. Here P expresses a control on 
the codeword power. For practical coding the challenge is to 
achieve arbitrary rates below the capacity while guaranteeing 
reliable decoding in manageable computation time. 
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Solution to the Gaussian channel coding problem, when 
married to appropriate modulation schemes, is regarded as 
relevant to myriad settings involving transmission over wires 
or cables for internet, television, or telephone communications 
or in wireless radio, TV, phone, satellite or other space 
communications. 

Previous standard approaches, as discussed in lfl8l . entail 
a decomposition into separate problems of modulation, of 
shaping of a multivariate signal constellation, and of coding. 
As they point out, though there are practical schemes with 
empirically good performance, theory for practical schemes 
achieving capacity is lacking. In the next subsection we 
describe the framework of our codes. 

A. Sparse Superposition Codes 

The framework here is as introduced in [24], but for clarity 
we describe it again in brief. The story begins with a list 
X%, X2, ■ ■ ■ , Xm of vectors, each with n coordinates, which 
can be thought of as organized into a design, or dictionary, 
matrix X, where, 

X nX N = [Xi : Xi : . . . : Xn\. 

The entries of X are drawn i.i.d. iV(0, 1). The codeword 
vectors take the form of particular linear combinations of 
columns of the design matrix. 

More specifically, we assume N — LAI, with L and M 
positive integers, and the design matrix X is split into L 
sections, each of size M. The codewords are of the form Xf5, 
where each (3 £ ~R N belongs to the set 

B = {j3 : f3 has exactly one non-zero in each section, 

with value in section £ equal to J P(e)}- 

This is depicted in figure [Tj The values P(t), for I = 1, . . . , L, 
chosen beforehand, are positive and satisfy 

L 

t 

where recall that P is the power for our code. 

The received vector is in accordance with the statistical 
linear model Y — X(3 + e, where s is the noise vector 
distributed N(0, a 2 1). 

Accordingly, with the P^ chosen to satisfy Q, we have 
||/?|| 2 = P and hence, E||X/3|| 2 /n = P, for each /3 in B. 
Here |.| denotes the usual Euclidian norm. Thus the expected 
codeword power is controlled to be equal to P. Consequently, 
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Fig. 1. Schematic rendering of the dictionary matrix X and coefficient vector 
/3. The vertical bars in the X matrix indicate the selected columns from a 
section. 



most of the codewords have power near P and the average 
power across the M L codewords, given by, 



M L 

/3<EB 

is concentrated at P. 

Here, we study both the case of constant power allocation, 
where each is equal to P/L, and a variable power allo- 
cation where Pip, is proportional to e~ 2Ce / L . These variable 
power allocations are used in getting the rate up to capacity. 
This is a slight difference from the setup in l24l . where the 
analysis was for the constant power allocation. 

For ease in encoding, it is most convenient that the section 
size M is a power of two. Then an input bit string u of length 
K = L log 2 M splits into L substrings of size log 2 M and 
the encoder becomes trivial. Each substring of u gives the 
index (or memory address) of the term to be sent from the 
corresponding section. 

As we have said, the rate of the code is R = K/n input 
bits per channel uses and we arrange for arbitrary R less than 
C. For the partitioned superposition code this rate is 



R = 



L log M 



For specified L, M and R, the codelength n = (L/R) logM. 
Thus the block length n and the subset size L agree to within 
a log factor. 

Control of the dictionary size is critical to computationally 
advantageous coding and decoding. At one extreme, L is a 
constant, and section size M = 2 nR / L . However, its size, 
which is exponential in n, is impractically large. At the other 
extreme L = nR and M = 2. However, in this case the 
number of non-zeroes of j3 proves to be too dense to permit 
reliable recovery at rates all the way up to capacity. This can be 
inferred from recent converse results on information-theoretic 
limits of subset recovery in regression (see for eg. l37l . El)- 

Our codes lie in between these extremes. We allow L to 
agree with the blocklength n to within a log factor, with M 
arranged to be polynomial in n or L. For example, we may let 
M = n, in which case L = nR/ log n, or we may set M = L, 
making n = (L\ogL)/R. For the decoder we develop here, 
at rates below capacity, the error probability is also shown to 
be exponentially small in L. 



Optimal decoding for minimal average probability of error 
consists of finding the codeword X/3 with coefficient vector 
(3 £ B that maximizes the posterior probability, conditioned 
on X and Y. This coincides, in the case of equal prior 
probabilities, with the maximum likelihood rule of seeking 

argmin \\Y — XB\\. 

Performance bounds for such optimal, though computationally 
infeasible, decoding are developed in the companion paper 
l24l . Instead, here we develop fast algorithms for which we 
can still establish desired reliability and rate properties. We 
describe the intuition behind the algorithm in the next section. 
Section [II] describes the algorithm in full detail. 

B. Intuition behind the algorithm 

From the received Y and knowledge of the dictionary, 
we decode which terms were sent by an iterative algorithm. 
Denote as 



sent = {j : ± 0} 



and 



other = {j : /3j = 0}. 



The set sent consists of one term from each section, and 
denotes the set of correct terms, while other denotes the set 
of wrong terms. We now give a high-level description of the 
algorithm. 

The first step is as follows. For each term Xj of the 
dictionary, compute the normalized inner product with the 
received string Y, given by, 



■Zi, 



XjY 
\\Y\\ ' 



and see if it exceeds a positive threshold r. 

The idea of the threshold r is that very few of the terms 
in other will be above threshold. Yet a positive fraction 
of the terms in sent will be above threshold, and hence, 
will be correctly decoded on this first step. Denoting as 
J= {1, 2, N}, take 

deci = {j € J : Z\ t j > r} 

as the set of terms detected in the first step. 

Denoting Pj = Pm if j is in section t, the output of the 
first step consists of the set of decoded terms dec\ and the 
vector 

j&deci 

which forms the first part of the fit. The set of terms inves- 
tigated in step 1 is J\ = J, the set of all columns of the 
dictionary. Then the set J 2 = J\ — dec\ remains for second 
step consideration. In the extremely unlikely event that dec\ 
is already at least L there will be no need for the second step. 
For the second step, compute the residual vector 

Ri=Y -F v 

For each of the remaining terms, that is terms in J2, compute 
the normalized inner product 



Xj i? 2 

JrJ : 
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which is compared to the same threshold r. Then dec2, the set 
of decoded terms for the second step, is chosen in a manner 
similar to that in the first step. In other words, we take 

dec 2 = {j E J 2 : Z% > r}. 

From the set dee-i, compute the fit F2 = Yljedec 2 V^i 
for the second step. 

The third and subsequent steps would proceed in the same 
manner as second step. For any step k, we are only interested 
in 

Jfe = J — dec\ U dec2 . . . U deck-i, 

that is, terms not decoded previously. One first computes the 
residual vector — Y — (F\ + . . . + Ffc_i). Accordingly, 
for terms in J/., we take deck as the set of terms for which 
Z™? = XjR k /\\R k \\ is above r. 

We arrange the algorithm to continue until at most a pre- 
specified number of steps m, arranged to be of the order of 
log M. The algorithm could stop before m steps if either there 
are no terms above threshold, or if L terms have been detected. 
Also, if in the course of the algorithm, two terms are detected 
in a section, then we declare an error in that section. 

Ideally, the decoder selects one term from each section, 
producing an output which is the index of the selected term. 
For a particular section, there are three possible ways a mistake 
could occur when the algorithm is completed. The first is an 
error, in which the algorithm selects exactly one wrong term 
in that section. The second case is when two or more terms are 
selected, and the third is when no term is selected. We call the 
second and third cases erasures since we know for sure that in 
these cases an error has occurred. Let Smis error > ^mis erasure 
denote the fraction of sections with error, erasures respectively. 
Denoting the section mistake rate, 

^mis — 2 6 "mis, err or ~t~ &mis,erasej (2) 

our analysis provides a good bound, denoted by 8 m i s , on 6 m i S 
that is satisfied with high probability. 

The algorithm we analyze, although very similar in spirit, 
is a modification of the above algorithm. The modifications 
are made so as to help characterize the distributions of Z™?, 
for k > 2. These are described in section |ll] For ease of 
exposition, we first summarize the results using the modified 
algorithm. 

C. Performance of the algorithm 

With constant power allocation, that is with — P/L for 
each I, the decoder is shown to reliably achieve rates up to 
a threshold rate R = (l/2)P/(P + a 2 ), which is less than 
capacity. This rate i?o is seen to be close to the capacity 
when the signal-to-noise ratio snr is low. However, since it is 
bounded by 1/2, it is substantially less than the capacity for 
larger snr. To bring the rate higher, up to capacity, we use 
variable power allocation with power 

P W oce- 2C ^, (3) 

for sections £ from 1 to L. 

As we shall review, such power allocation also would arise 
if one were attempting to successively decode one section at 
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Fig. 2. Plot of the update function (x). The dots measure the proportion 
of sections correctly detected after a particular number of steps. Here M = 
2 16 , snr = 7, R = 0.74 and L taken to be equal to M. The height reached 
by the curve at the final step corresponds to a 0.986 proportion of 

section correctly detected, and a failed detection rate target of 0.013. The 
accumulated false alarm rate bound is 0.008. The probability of mistake rates 
larger than these targets is bounded by 1.5 X 10 -3 . 

a time, with the signal contributions of as yet un-decoded 
sections treated as noise, in a way that splits the rate C into 
L pieces each of size C/L; however, such decoding would 
require the section sizes to be exponentially large to achieve 
desired reliability. In contrast, in our adaptive scheme, many 
of the sections are considered each step. 

For rate near capacity, it helpful to use a modified power 
allocation, where 

P ( £) oc max{e" 2C T Jl , u}, (4) 

with a non-negative value of u. However, since its analysis is 
more involved we do not pursue this here. Interested readers 
may refer to documents Q, ll22ll for a more thorough analysis 
including this power allocation. 

The analysis leads us to a function : [0, 1] — > [0, 1], 
which depends on the power allocation and the various pa- 
rameters L, M, snr and R, that proves to be helpful in under- 
standing the performance of successive steps of the algorithm. 
If x is the previous success rate, then <7l(x) quantifies the 
expected success rate after the next step. An example of the 
role of <?£ is shown in Fig [2] 

An outer Reed-Solomon codes completes the task of identi- 
fying the fraction of sections that have errors or erasures (see 
section VI of Joseph and Barron [24] for details) so that we 
end up with a small block error probability. If R ou ter = 1—5 
is the rate of an RS code, with < S < 1, then a section 
mistake rate 8 m is less than 5 m i S can be corrected, provided 
$mis < 8. Further, if R is the rate associated with our 
inner (superposition) code, then the total rate after correcting 
for the remaining mistakes is given by R tot = R outer R. 
The end result, using our theory for the distribution of the 
fraction of mistakes of the superposition code, is that the block 
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error probability is exponentially small. One may regard the 
composite code as a superposition code in which the subsets 
are forced to maintain at least a certain minimal separation, so 
that decoding to within a certain distance from the true subset 
implies exact decoding. 

In the proposition below, we assume that the power alloca- 
tion is given by ([3]). Further, we assume that the threshold r 
is of the form, 

r = V21ogM + a, (5) 



where here A = (C — Rtot)/C and const denotes a positive 
constant. 



with a positive a specified in subsection VII-D 



C* = 



We allow rate R up to C* , where C* can be written as 

C 

1 + drop* 

Here drop* is a positive quantity given explicitly later in this 
paper. It is near 

5m = / i wf ' ( 6 ) 
V7r log M 

ignoring terms of smaller order. 

Thus C* is within order 1 / \/\og M of capacity and tends 
to C for large M. With the modified power allocation Q, it 
is shown in Q, E2l that one can make C* of order 1/logM 
of capacity. 

Proposition 1. For any inner code rate R < C*, express it in 
the form ^ 

R = l + K /logM' (?) 
with k > 0. Then, for the partitioned superposition code, 
I) The adaptive successive decoder admits fraction of sec- 



tion mistakes less than 

5mis — 



3« + 5 5m 
8C log M + 2C 
except in a set of probability not more than 

min|K3(A*) 2 , K4(A*) j- 



(8) 



Pe 



where 



(C* ~R)/C* 



Here K\ is a constant to be specified later that is only 
polynomial in M. Also, K2, ft3 and K4 are constants that 
depend on the snr. See subsection | VII-E| for details. 
II) After composition with an outer Reed Solomon code the 
decoder admits block error probability less than p e , with 
the composite rate being R tot = (1 — 5 mis ) R. 

The proof of the above proposition is given in subsection 



VII-E The following is an immediate consequence of the 



above proposition. 

Corollary 2. For any fixed total communication rate R to t < C, 
there exists a dictionary X with size N = LM that is 
polynomial in block length n, for which the sparse super- 
position code, with adaptive successive decoding (and outer 
Reed-Solomon code), admits block error probability p e that is 
exponentially small in L. In particular, 

Urn — log(l/p e ) > constmm{A, A 2 }, 



The above corollary follows since if n is of order y/log M, 
then A* is of order l/^/IogM. Further, as C* is of order 
1/ylogM below the capacity C, we also get that Ai nner = 
(C — R)/C is also 1/ylogM below capacity. From the 
expression for 5 m i S in tfHJ, one sees that the same holds for 
the total rate drop, that is A = (C — Rtot)/C. A more rigorous 
proof is given in subsection |VII-E| 

D. Comparison with Least Squares estimator 

Here we compare the rate achieved here by our practical 
decoder with what is achieved with the theoretically optimal, 
but possibly impractical, least squares decoding of these sparse 
superposition codes shown in the companion paper l24ll . 

Let A = (C — R)/C be the rate drop from capacity, with R 
not more than C. The rate drop A takes values between and 
1. With power allocated equally across sections, that is with 
P{l) = P/L, it was shown in J24| that for any 5 m i 8 € [0, 1), 
the probability of more than a fraction <5 m i S of mistakes, with 
least squares decoding, is less than 

cxp{-nci min{A 2 , 5 mls }}, 

for any positive rate drop A and any size n. The error exponent 
for the above is correct, in terms of orders of magnitude, to the 
theoretically best possible error probability for any decoding 
scheme, as established by Shannon and Gallager, and reviewed 
for instance in (29]. 

The bound obtained for the least squares decoder is better 
than that obtained for our practical decoder in its freedom 
of any choice of mistake fraction, rate drop and size of the 
dictionary matrix X. Here, we allow for rate drop A to be 
of order 1/ylogM. Further, from the expression ([8]), we 
have 5 m i S is of order l/ylogM, when k is taken to be of 
0(y/\ogM). Consequently, we compare the error exponents 
obtained here with that of the least squares estimator of [24|, 
when both A and 5 m i S are of order 1 / y/\og M. 

Using the expression given above for the least squares 
decoder one sees that the exponent is of order n/(logM), 
or equivalently L, using n = (L log M)/R. For our decoder, 
the error probability bound is seen to be exponentially small in 
L/(logM) using the expression given in Proposition [T] This 
bound is within a (log M) factor of what we obtained for the 
optimal least squares decoding of sparse superposition codes. 

E. Related work in Coding 

We point out several directions of past work that connect to 
what is developed here. Modern day communication schemes, 
for example LDPC |fT9l and Turbo Codes |[T2l . have been 
demonstrated to have empirically good performance. However, 
a mathematical theory for their reliability is restricted only to 
certain special cases, for example erasure channels |27|. 

These LDPC and Turbo codes use message passing al- 
gorithms for their decoding. Interestingly, there has been 
recent work by Bayati and Montanari [11 1 that has extended 
the use of these algorithms for estimation in the general 
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high-dimensional regression setup with Gaussian X matrices. 
Unlike our adaptive successive decoder, where we decide 
whether or not to select a particular term in a step, these 
iterative algorithms make soft decisions in each step. However, 
analysis addressing rates of communication have not been 
given these works. Subsequent to the present work, an alterna- 
tive algorithm with soft-decision decoding for our partitioned 
superposition codes is proposed and analyzed by Barron and 
Cho E). 

A different approach to reliable and computationally- 
feasible decoding, with restriction to binary signaling, is in the 
work on channel polarization of 0], Q. These polar codes 
have been adapted to the Gaussian case as in (TJ, however, the 
error probability is exponentially small in y/n, rather than n. 

The ideas of superposition codes, rate splitting, and succes- 
sive decoding for Gaussian noise channels began with Cover 
|[T5l in the context of multiple-user channels. There, each 
section corresponds to the codebook for a particular user, and 
what is sent is a sum of codewords, one from each user. Here 
we are putting that idea to use for the original Shannon single- 
user problem, with the difference that we allow the number of 
sections to grow with blocklength n, allowing for manageable 
dictionaries. 

Other developments on broadcast channels by Cover ifTSl . 
that we use, is that for such Gaussian channels, the power 
allocation can be arranged as in ^ such that messages can be 
peeled off one at a time by successive decoding. However, such 
successive decoding applied to our setting would not result 
in the exponentially small error probability that we seek for 
manageable dictionaries. It is for this reason that instead of 
selecting the terms one at a time, we select multiple terms in 
a step adaptively, depending upon whether their correlation is 
high or not. 

A variant of our regression setup was proposed by Tropp 
||33l for communication in the single user setup. However, his 
approach does not lead to communication at positive rates, as 
discussed in the next subsection. 

There have been recent works that have used our partitioned 
coding setup for providing a practical solution to the Gaussian 
source coding problem, as in Kontoyiannis et al. [25 1 and 
Venkataramanan et al. ||36ll . A successive decoding algorithm 
for this problem is being analyzed by Venkataramanan et al. 
[35 1 . An intriguing aspect of the analysis in 051 is that the 
source coding proceeds successively, without the need for 
adaptation across multiple sections as needed here. 

F. Relationships to sparse signal recovery 

Here we comment on the relationships to high-dimensional 
regression. A very common assumption is that the coefficient 
vector is sparse, meaning that it has only a few, in our case L, 
non-zeroes, with L typically much smaller than the dimension 
N. Note, unlike our communication setting, it is not assumed 
that the magnitude of the non-zeroes be known. Most relevant 
to our setting are works on support recovery, or the recovery of 
the non-zeroes of f3, when /3 is typically allowed to belong to 
a set with L non-zeroes, with the magnitude of the non-zeroes 
being at least a certain positive value. 



Popular techniques for such problems involve relaxation 
with an l\ -penalty on the coefficient vector, for example in the 
basis pursuit |[T4l and Lasso [31] algorithms. An alternative is 
to perform a smaller number of iterations, such as we do here, 
aimed at determining the target subset. Such works on sparse 
approximation and term selection concerns a class of iterative 
procedures which may be called relaxed greedy algorithms 
(including orthogonal matching pursuit or OMP) as studied in 
EH, 0, ED, E6), QUI, EE El, EOj, El- In essence, 
each step of these algorithms finds, for a given set of vectors, 
the one which maximizes the inner product with the residuals 
from the previous iteration and then uses it to update the linear 
combination. Our adaptive successive decoder is similar in 
spirit to these algorithms. 

Results on support recovery can broadly be divided into two 
categories. The first involves giving, for a given X matrix, 
uniform guarantees for support recovery. In other words, it 
guarantees, for any f3 in the allowed set of coefficient vectors, 
that the probability of recovery is high. The second category 
of research involves results where the probability of recovery 
is obtained after certain averaging, where the averaging is over 
a distribution of the X matrix. 

For the first approach, a common condition on the X matrix 
is the mutual incoherence condition, which assumes that the 
correlation between any two distinct columns be small. In 
particular, assuming that ||X, || 2 = n, for each j = 1, . . . ,N, 
it is assumed that, 

-xaaxlXjXjA is 0(1/L). (9) 

Another related criterion is the irrepresentable criterion |32|, 
BTI . However, the above conditions are too stringent for our 
purpose of communicating at rates up to capacity. Indeed, for 
i.i.d N(0, 1) designs, n needs to be il(L 2 log M) for these 
conditions to be satisfied. Here n — Q,(L 2 log M) denotes 
that n > constL 2 log Al, for a positive const not depending 
upon L or M. In other words, the rate R is of order 1/L, 
which goes to for large L. Correspondingly, results from 
these works cannot be directly applied to our communication 
problem. 

As mentioned earlier, the idea of adapting techniques in 
compressed sensing to solve the communication problem 
began with Tropp [33 1. However, since he used a condition 
similar to the irrepresentable condition discussed above, his 
results do not demonstrate communication at positive rates. 

We also remark that conditions such as |9) are required 
by algorithms such as Lasso and OMP for providing uniform 
guarantees on support recovery. However, there are algorithms 
which provided guarantees with much weaker conditions on 
X. Examples include the iterative forward-backward algorithm 
[40 1 and least squares minimization using concave penalties 
l39l . Even though these results, when translated to our set- 
ting, do imply communication at positive rates is possible, a 
demonstration that rates up to capacity can be achieved has 
been lacking. 

The second approach, as discussed above, is to assign a 
distribution for the X matrix and analyze performance after 
averaging over this distribution. Wainwright [38 1 considers X 
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matrices with rows i.i.d. N(0, £), where S satisfies certain 
conditions, and shows that recovery is possible with the Lasso 
with n that is fl(L log M). In particular his results hold for 
the i.i.d. Gaussian ensembles that we consider here. Analogous 
results for the OMP was shown by Joseph [23 1. Another result 
in the same spirit of average case analysis is done by Candes 
and Plan [13] for the Lasso, where the authors assign a prior 
distribution to f3 and study the performance after averaging 
over this distribution. The X matrix is assumed to satisfy a 
weaker form of the incoherence condition that holds with high- 
probability for i.i.d Gaussian designs, with n again of the right 
order. 

A caveat in these discussions is that the aim of much 
(though not all) of the work on sparse signal recovery, 
compressed sensing, and term selection in linear statistical 
models is distinct from the purpose of communication alone. 
In particular rather than the non-zero coefficients being fixed 
according to a particular power allocation, the aim is to allow a 
class of coefficients vectors, such as that described above, and 
still recover their support and estimate the coefficient values. 
The main distinction from us being that our coefficient vectors 
belong to a finite set, of M L elements, whereas in the above 
literature the class of coefficients vectors is almost always 
infinite. This additional flexibility is one of the reasons why 
an exact characterization of achieved rate has not been done 
in these works. 

Another point of distinction is that majority of these works 
focus on exact recovery of the support of the true of coefficient 
vector (3. As mentioned before, as our non-zeroes are quite 
small (of the order of 1/y/L), one cannot get exponentially 
small error probabilities for exact support recovery. Corre- 
spondingly, it is essential to relax the stipulation of exact 
support recovery and allow for a certain small fraction of 
mistakes (both false alarms and failed detection). To the best 
of our knowledge, there is still a need in the sparse signal 
recovery literature to provide proper controls on these mistakes 
rates to get significantly lower error probabilities. 

Section III] describes our adaptive successive decoder in 



required for the algorithm. Section IV presents the tools for the 
theoretical analysis of the algorithm, while section |V| presents 
the theorem for reliability of the algorithm. Computational 
illustrations are included in section [VI] Section |VII| proves 
results for the function of figure [2] required for the 
demonstrating that one can indeed achieve rates up to capacity. 
The appendix collects some auxiliary matters. 

II. The Decoder 

The algorithm we analyze is a modification of the algorithm 
described in subsection II-BI The main reason for the modifi- 
cation is due to the difficulty in analyzing the statistics Z k e J, 
for j £ Jfc and for steps fc > 2. 

The distribution of the statistic Zi j, used in the first step, 
is easy, as will be seen below. This is because of the fact that 
the random variables 

{Xj, j G J} and Y 



are jointly multivariate normal. However, this fails to hold for 
the random variables, 

{Xj, j G Jfc} and R k 

used in forming Z k e J . 

It is not hard to see why this joint Gaussianity fails. Recall 
that Rk may be expressed as, 



Rk=Y 



E 



Correspondingly, since the event decx^-i is not independent 
of the X/s, the quantities R k , for fc > 2, are no longer normal 
random vectors. It is for this reason the we introduce the 
following two modifications. 

A. The first modification: Using a combined statistic 

We overcome the above difficulty in the following manner. 
Recall that each 



Rk 



Y — Ft 



(10) 



is a sum of Y and —F\, . . . , —F^-i- Let G\ =Y and denote 
Gfe, for k > 2, as the part of —Fk-i that is orthogonal to 
the previous Gfc's. In other words, perform Grahm-Schmidt 
orthogonalization on the vectors Y, — . . . , — -Fft-i, to get 
G fe s with fc' = 1, .. . , fc. Then, from {TO}, 



Rr G\ G2 

m = wetghH m +^ght 2 — +. . .+we lg ht k — 

for some weights, denoted by weighty — weighty , k , for 
k' = 1, . . . , fc. More specifically, 



weighty 



RZGh 



\Rk\\\\G k . 



and, 



weight^ + . . . + weighty = 1. 



Correspondingly, the statistic Z r k J = XJ Rk/\\R k \\, which we 
want to use for fc th step detection, may be expressed as, 



full detail. Section III describes the computational resource 2 



k j - weighti Z ltj 
where, 



-weight 2 Z 2 j 



2k, j — XjGk/\\Gk\ 



-weight k -iZk,j, 



(11) 



Instead of using the statistic Z k e J, for fc > 2, we find it more 
convenient to use statistics of the form, 



gcomb 



k.j — ^l,k + ^2,k Z 2 J + ■ ■ ■ + A fc:fe Z k j, (12) 
where Xy fe, for fc' = 1, . . . , fc are positive weights satisfying, 



A 



k'k 



1. 



For convenience, unless there is some ambiguity, we sup- 
press the dependence on fc and denote Xy ,k as simply Xy. 
Essentially, we choose Ai so that it is a deterministic proxy for 
weighti given above. Similarly, Xy is a proxy for weighty 
for fc' > 2. The important modification we make, of replacing 
the random weight^s by proxy weights, enables us to give 
an explicit characterization of the distribution of the statistic 
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Zf,°™ b , which we use as a proxy for Z™? for detection of 
additional terms in successive iterations. 

We now describe the algorithm after incorporating the above 
modification. For the time-being assume that for each k we 
have a vector of deterministic weights, 



(A* 



k' = 1, 



k), 



satisfying J2k'=i Al' = 1> where recall that for convenience 
we denote Xk'.k as A*/. Recall G\ = Y. 

For step k = 1, do the following 

• For j e J, compute 

Zij = XjGi/\\Gi\\. 



To provide consistency with the notation used below, we 
also denote Z\ t 
Update 



oc , rzcomb 
,3 aS & \j ■ 



de Cl = {] G J : Z^ b > t}, 



(13) 



which corresponds to the set of decoded terms for the 
first step. Also let decu — dec\. Update 



* = E 

j£deci 

This completes the actions of the first step. Next, perform the 
following steps for k > 2, with the number of steps k to be 
at most a pre-define value to. 

• Define Gk as the part 
G\, ■ ■ ■ , Gk-i- 

• For j G J/c = J — decik-i, 



of -F, 



fc-i 



orthogonal to 



calculate 



(14) 



For j G Jfc, compute the combined statistic using the 

1, given by, 



above Zt~,j and Z^/j, < k' < k 
£c°j lb — \ 1 2, x j + A2 Z2J 



+ A fc Z, 



k ^kj) 



where the weights A&< = Xk,k', which we specify later, 
are positive and have sum of squares equal to 1. 
Update 

>r}, (15) 



dec k = {j G J k : Z c k °™ b 



which corresponds to the set of decoded terms for the k 
th step. Also let dec\^ = dec\^,-\ U deck, which is the 
set of terms detected after k steps. 
• This completes the k th step. Stop if either L terms have 
been decoded, or if no terms are above threshold, or if 
k = m. Otherwise increase k by 1 and repeat. 
As mentioned earlier, part of what makes the above work 
is our ability to assign deterministic weights (Xk,k' '■ = 
1, . . . , k), for each step k = 1, . . . , m. To be able to do so, 
we need good control on the (weigthed) sizes of the set of 
decoded terms deci^ a ft er step k, for each k. In particular, 
defining for each j, the quantity ttj = Pj/P, we define the 
size of the set decik as size\_k, where 



sizeik 



(16) 



Notice that size\^ is increasing in k, and is a random quantity 
which depends on the number of correct detections and false 
alarms in each step. As we shall see, we need to provide 
good upper and lower bounds for the sizei.i, . . . , sizei^-i 
that are satisfied with high probability, to be able to provide 
deterministic weights of combination, Xk'.k, for k' = 1, . . . ,fc, 
for the kth step. 

It turns out that the existing algorithm does not provide the 
means to give good controls on the size\ ^'s. To be able to 
do so, we need to further modify our algorithm. 

B. The second modification: Pacing the steps 

As mentioned above, we need to get good controls on the 
quantity size\ : k, for each k, where sizei.k is defined as above. 
For this we modify the algorithm even further. 

Assume that we have certain pre-specified positive values, 
which we call gx fc> f° r k = 1, . . . , m . Explicit expressions 
qi.fc, which are taken to be strictly increasing in k, will be 
specified later on. The weights of combination, 



(A 



k'k 



k' 



1. 



k), 



for k = 1, . . . , to, will be functions of these values. 
For each k, denote 



threshi 



U : Z^f > r}. 



For the algorithm described in the previous subsection, deck, 
the set of decoded terms for the k th step, was taken to be 
equal to threshk- We make the following modification: 

For each k, instead of making deck to be equal to threshk, 
take deck to be a subset of threshk so that the total size of 
the of the decoded set after k steps, given by size\ : k is near 
qi.fe. The set deck is chosen by selecting terms in threshk, 
in decreasing order of their Z k zo j nb values, until sizeik nearly 
equal 

In particular given size\ t k-\, one continues to add terms in 
deck, if possible, until 



q x ,k - < size x ,k < 0.1, k- 



(17) 



Here 1/L„ = min^7rm, is the minimum non-zero weights 
over all sections. It is a small term of order 1/L for the power 
allocations we consider. 

Of course the set of terms threshk might not be large 
enough to arrange for deck satisfying ( ff7] >. Nevertheless, it 
is satisfied, provided 

size x .k-\ + ^ n i - 9i,k, 

jE-threshk 



or equivalently, 

E + E 



(18) 



Here we use the fact that Jk = J — dec\^-i- 

Our analysis demonstrates that we can arrange for an in- 
creasing sequence of qi k, with qi m near 1, such that condition 
( fT~8] > is satisfied for k = 1, ...,m, with high probability. 
Correspondingly, size± t k is near q± t k f° r eacn k with high 



8 



probability. In particular, sizei iTn , the weighted size of the 
decoded set after the final step, is near qi, m , which is near 1. 

We remark that in Q, an alternative technique for analyzing 
the distributions of Z™j Tlb , for j g Jk, is pursued, which 
does away with the above approach of pacing the steps. The 
technique in [7 1 provides uniform bounds on the performance 
for collection of random variables indexed by the vectors of 
weights of combination. However, since the pacing approach 
leads to cleaner analysis, we pursue it here. 



III. Computational resource 

For the decoder described in section [TTJ the vectors Gk can 
be computed efficiently using the Grahm-Schmidt procedure. 
Further, as will be seen, the weights of combination are chosen 
so that, for each k, 



rvcornb /-i \2 ^rcomb i \ 

*k,j ~ V l ~ A k,k^k-i,j + A 



k,kZfcj 



This allows us to computed the statistic Z£°™ b easily from the 
previous combined statistic. Correspondingly, for simplicity 
we describe here the computational time of the algorithm in 
subsection |I-B| in which one works with the residuals and 
accepts each term above threshold. Similar results hold for 
the decoder in section [TT] 

The inner products requires order nLM multiply and adds 
each step, yielding a total computation of order nLMm for to 
steps. As we shall see the ideal number of steps to according 
to our bounds is of order log M. 

When there is a stream of strings Y arriving in succession 
at the decoder, it is natural to organize the computations in 
a parallel and pipelined fashion as follows. One allocates m 
signal processing chips, each configured nearly identically, to 
do the inner products. One such chip does the inner products 
with Y, a second chip does the inner products with the 
residuals from the preceding received string, and so on, up 
to chip to which is working on the final decoding step from 
the string received several steps before. After an initial delay 
of to received strings, all to chips are working simultaneously. 

If each of the signal processing chips keeps a local copy of 
the dictionary X, alleviating the challenge of numerous simul- 
taneous memory calls, the total computational space (memory 
positions) involved in the decoder is nLMm, along with space 
for LMm multiplier-accumulators, to achieve constant order 
computation time per received symbol. Naturally, there is the 
alternative of increased computation time with less space; 
indeed, decoding by serial computation would have runtime of 
order nLMm. Substituting L = nR/ log M and to of order 
logM, we may reexpress nLMm as n 2 M . This is the total 
computational resource required (either space or time) for the 
sparse superposition decoder. 

IV. Analysis 

Recall that we need to give controls on the random quantity 
Smis given by Our analysis leads to controls on the 
following weighted measures of correct detections and false 
alarms for a step. Let irj — Pj/P, where recall that Pj = P^ 



for any j in section i. The irj sums to 1 across j in sent, and 
sums to M—l across j in other. Define in general 



qk = n i> 

j^sentndecfc 



(19) 



which provides a weighted measure for the number of correct 
detections in step k, and 



fk 



E 



(20) 



j £otherC]deCk 



for the false alarms in step k. Bounds on 8 m i S can be obtained 
from the quantities q^ and as we now describe. 
Denote 



Swght = (1 - 2J <&) + /.fk 



(21) 



k=l 



k=l 



An equivalent way of expressing 5 wg ht is the sum of I from 
1 to L of, 



/ K j^{j e sentndec^ m } + 7t J-'-{j S otherndecx <m } ■ 

j£section i 

In the equal power allocation case, where Tij — 1/L, one has 
&mis < 8 wg ht- This can be seen by examining the contribution 
from a section to S m i S and 8 wg ht- We consider the three 
possible cases. In the case when the section has neither an error 
or an erasure, its contribution to both 8 m is and 8 wg ht would be 
zero. Next, when a section has an error, its contribution would 
be 2/L to S m i S from (|2j. Its contribution to 8 wg ht would also 
be 2/L, with a 1/L contribution from the correct term (since 
it is in sent n dec\ m ), and another 1 /L from the wrong term. 
Lastly, for a section with an erasure, its contribution to 5 m i S 
would be 1/L, while its contribution to 8 wg ht would be at least 
1 /L. The contribution in the latter case would be greater than 
1/L if there are multiple terms in other that were selected. 

For the power allocation ([3]) that we consider, bounds 
on 8 m i S are obtained by multiplying 8 wg ht by the factor 
snr/(2C). To see this, notice that for a given weighted 
fraction, the maximum possible un-weighted fraction would 
be if we assume that all the failed detection or false alarms 
came from the section with the smallest weight. This would 
correspond to the section with weight where it is seen 

that tt(q = 2C/(Lsnr). Accordingly, if 8 wg ht were an upper 
bound on 8 wg ht that is satisfied with high probability, we take 



snr 

Omis ~2u"~ 



(22) 



so that S m i S < S m is with high probability as well. 

Next, we characterize, for k > 1, the distribution of Zi-j, 
for j 6 Jfe. As we mentioned earlier, the distribution of Z\ j is 
easy to characterize. Correspondingly, we do this separately in 



the next subsection. In subsection IV-B we provide the analysis 



for the distribution of Z%°™ b , for k > 2. 



A. Analysis of the first step 

In Lemma [3]below we derive the distributional properties of 
{Zi j : j £ J). Lemma [4] in the next subsection, characterizes 
the distribution of (Zkj : j G Jfe) for steps k > 2 . 
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Define 

Cj,R — n ttj v, (23) 

where v = P/(P + a 2 ). For the constant power allocation 
case, TTj equals 1/L. In this case Cj.R — {Ro/R) 2 log M is 
the same for all j. 

For the power allocation (pi, we have 

7T, = e - 2C ^- 1 )/ L (l-e- 2C / L )/(l-e- 2C ), 

for each j in section £. Let 

C = {L/2)[l-e- 2C / L ], (24) 

which is essentially identical to C when L is large. Then for 
j in section £, we have 

C jtR = (C/JZ) e- 2C ^ 1 )/ L (21ogM). (25) 

We now are in a position to give the lemma for the 
distribution of Z\ j, for j £ J. The lemma below shows 
that each Zij is distributed as a shifted normal, where the 
shift is approximately equal to \[Cjji for any j in sent, and 
is zero for j in other. Accordingly, for a particular section, 
the maximum of the Z\j, for j £ other, is seen to be 
approximately -\/21ogM, since it is the maximum of M — 1 
independent standard normal random variables. Consequently, 
one would like WCj^R to be at least \J2 log Af for the correct 
term in that section to be detected. 

Lemma 3. For each j £ J, the statistic Z\j can be repre- 
sented as 

\J Cj,R (A'„/v / n)l{j sent} + 

where Zi = {Z\ j : j £ J\) is multivariate normal iV(0, Si), 
with Si = I - 5i51/P, where (Si = v(3. 
Also, 

r2 _ mi 2 

n a 2 

is a Chi-square n random variable that is independent of Z\ = 
(Zi j : j £ J). Here ay = V-P + cr 2 is the standard deviation 
of each coordinate of Y. 

Proof: Recall that the Xj, for j in J, are independent 
iV(0, /) random vectors and that Y = J2j PjXj + e, where 
the sum of squares of the /3j is equal to P 

The conditional distribution of each Xj given Y may be 
expressed as, 

Xj =PjY/a 2 Y + Uj, (26) 

where Uj is a vector in M. N having a multivariate normal 
distribution. Denote b = (3/ ay- It is seen that 

Uj ~ N n (0, (1 - b 2 )l) , 

where bj is the j th coordinate of b. 

Further, letting U = [U\ : . . . : Un], it follows from the 
fact that the rows of [X : e/a] are i.i.d, that the rows of the 
matrix U are i.i.d. 

Further, for row i of U, the random variables Uij and [7j 
have mean zero and expected product 



In general, the covariance matrix of the ith row of U is given 
by Si. 

For any constant vector a ^ 0, consider C/Ja/HaH. Its joint 
normal distribution across terms j is the same for any such a. 
Specifically, it is a normal N(0, Si), with mean zero and the 
indicated covariances. 

Likewise define Z-yj = UjY /\\Y\\. Conditional on Y, one 
has that jointly across j, these Z\ t j have the normal N(Q, S) 
distribution. Correspondingly, Z\ = (Zij : j e J) is indepen- 
dent of Y, and has a iV(0, Si) distribution unconditionally. 

Where this gets us is revealed via the representation of the 
inner product Zij = XjY/\\Y\\, which using |26|, is given 

by, 

11*1 
"y 

The proof is completed by noticing that for j E sent, one has 
y/Cj,R = fij \fn~l ay. ■ 

B. Analysis of steps k > 2 

We need the characterize the distribution of the statistic 
Z£°" lb , j £ 3^, used in decoding additional terms for the fcth 
step. 

The statistic Z£° mb , j £ J k , can be expressed more clearly 
in the following manner. For each k > 1, denote, 

Zk ~ x mv 

Further, define 

%i,k = [Zi ■ Z 2 ■ ■ ■ ■ : Zjf] 

and let Kk = (Xk,i> Afe,2> • • • j ^k,k) T be the deterministic 
vector of weights of combinations used for the statistics 
Z£°™ b . Then Z£°" lb is simply the j th element of the vector 

zr ib = z 1>k K k . 

We remind that for step k we are only interested in elements 
j £ Jk, that is, those that were not decoded in previous steps. 

Below we characterize the distribution of Z£ omb condi- 
tioned on the what occurred on previous steps in the algorithm. 
More explicitly, we define T k -\ as 

J~k-i = {G\, G 2 , . . . , G k -i,Zi, . . . , Z k _i), (27) 

or the associated a-field of random variables. This represents 
the variables computed up to step k — 1. Notice that from the 
knowledge of Z k i, for A;' = l,...,fc — 1, one can compute 
-Z£? m \ for k' < k. Correspondingly, the set of decoded terms 
dec k r, till step fc — 1, is completely specified from knowledge 

Of Jfc-l. 

Next, note that in Z\ >k , only the vector Zk does not 
belong to J-fc-i- Correspondingly, the conditional distribution 
of Z^ omh given T k -\, is described completely by finding the 
distribution of Zk given T k -\. Accordingly, we only need to 
characterize the conditional distribution of Z^ given T k -\. 

Initializing with the distribution of Z\ derived in Lemma [5] 
we provide the conditional distributions 



1 {j=3'} ~ h i h 3' 



Z k ,.j k = (Z k .j : j £ J fc ), 



10 



for k = 2, . . . , n. As in the first step, we show that the distribu- 
tion of Z kt j k can be expressed as the sum of a mean vector and 
a multivariate normal noise vector Z k j h = (Z k j ■ j £ J k ). 
The algorithm will be arranged to stop long before n, so we 
will only need these up to some much smaller final k = rn. 
Note that J k is never empty because we decode at most L, so 
there must always be at least (M— 1)L remaining. 

The following measure of correct detections in step, ad- 
justed for false alarms, plays an important role in characteriz- 
ing the distributions of the statistics involved in an iteration. 
Denote 

Qk 



**adj 

% = 



1 + fk/qk 



(28) 



where q k and f k are given by (jT9j and ( |20| ). 

In the lemma below we denote Nj k (0, E) to be multivariate 
normal distribution with dimension \Jk\, having mean zero and 
covariance matrix E, where E is an \J k \ x |Jfc| dimensional 
matrix. Further, we denote (3j k to be the sub-vector of f3 
consisting of terms with indices in J k . 

Lemma 4. For each k > 2, the conditional distribution of 
Z k j , for j £ Jfc, given T k -\ nas the representation 

\/w k C jtR (X dk /y/n) l {j e sent} + Z k j . (29) 

Recall that Cj.n = niTjV. Further, w k — ,§& — Sfc-i, which are 
increments of a series with total 

1 



Wi + w 2 + ■ ■ ■ + w k 



Sfc = 



i -adj.tot ' 

1 - %-i v 



where 



~adj,tot 



-adj 

% ■ 



(30) 



Here wi = Ii = 1. The quantities qjf-* is given by ( |28) . 

The conditional distribution ¥z k , |J r i! _ 1 is normal 
Nj k (0, Efc), where the covariance Efc has the representation 

E fc = I- SkS^/P, where 5 k = y^/3 Jfc . 



Here v k — s k v. 

Define <j\ = Sfc_i/sfc. The X^ k term appearing in ( |29] > is 
given by 



Also, the distribution of X% given J-"fc_i, is chi-square with 
d k — n — k + 1 degrees of freedom, and further, it is 
independent of Z k} j k . 

The proof of the above lemma is considerably more in- 
volved. It is given in Appendix |A| From the above lemma one 
gets that Z k ,j is the sum of two terms - the 'shift' term and 
the 'noise' term Z k j. The lemma also provided that the noise 
term is normal with a certain covariance matrix Efc. 

Notice that Lemma|4]applies to the case k = 1 as well, with 
J~o defined as empty, since w k = s k = 1. The definition of Ei 
using the above lemma is the same as that given in Lemma [3] 
Also note that the conditional distribution of (Z k j : j £ J k ), 
as given in Lemma |4] depends on J- k -i only through the 
||Gi||, ... , ||G fe -i|| and [Z k ,j : j £ J*,) for k' < k. 

In the next subsection, we demonstrate that Z k j, for j e J k , 
are very close to being independent and identically distributed 
(ii.d.). 



C. The nearby distribution 

Recall that since the algorithm operates only on terms not 
detected previously, for the k step we are only interested in 
terms in J k . The previous two lemmas specified conditional 
distributions of Z k j, for j £ J k . However, for analysis 
purposes we find it helpful to assign distributions to the Z k j, 
for j £ J — Jfc as well. In particular, conditional on J- k -i, 
write 

Zk,j = \/w k C j:R MjEsent} + Z k j for j £ J. 

Fill out of specification of the distribution assigned by P, via 
a sequence of conditionals V Zk \jr k l for Z k = (Z k j : j £ J), 
which is for all j in J, not just for j in J k . For the variables 
Z k ,j k that we actually use, the conditional distribution is that 
of ¥z k j k \j r k _ 1 as specified in Lemmas |3j and |4| Whereas for 
the Z k ,j with j £ J—J k , given J-fc-i, we conveniently arrange 
them to be independent standard normal. This definition is 
contrary to the true conditional distribution of Z k j for j £ 
J — Jfc, given J-fc_i. However, it is a simple extension of the 
conditional distribution that shares the same marginalization 
to the true distribution of (Z k j : j £ J k ) given Fk-\- 

Further a simpler approximating distribution Q is defined. 
Define Q^i^-i to be independent standard normal. Also, 
like P, the measure Q makes the X$ appearing in Z k j, 
Chi-square(n — k + 1) random variables independent of Z k , 
conditional on T k -\. 

In the following lemma we appeal to a sense of closeness of 
the distribution P to Q, such that events exponentially unlikely 
under Q remain exponentially unlikely under the governing 
measure P. 



Lemma 5. For any event A that is determined by the random 
variables, 

HdH, . . . , ||Gfc|| and (Z Vtj : j £ J k >), for k' < k, (31) 
one has 

¥[A] < Q[A]e kC0 , 

where c = (1/2) log(l + P/a 2 ). 

For ease of exposition we give the proof in Appendix [B] 
Notice that the set A is T k measurable, since the random 
variables that A depends on are F k measurable. 

D. Separation analysis 

Our analysis demonstrates that we can give good lower 
bounds for q k , the weighted proportion of correct detection 
in each step, and good upper bounds on f k , which is the 
proportion of false alarms in each steps. 

Denote the exception events 

A k = {q k < q k } and B k = {f k > f k }. 

Here the q k and f k are deterministic bounds for the proportion 
of correct detections and false alarms respectively, for each k. 
These will be specified in the subsequent subsection. 

Assuming that we have got good controls on these quantities 
up to step k — 1, we now describe our characterization of 
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Z^°p b , for j G Jfc, used in detection for the kth step. Define 
the exception sets 



A hk -i = U K k ~=i A k' and B hk-i = U^iBfc/. 

The manner in which the quantities qi,.. ■ ,q k and f\,...fk 
arise in the distributional analysis of Lemma |4] is through the 
sum 



~adj,tot 



~adj 

1x 



-adj 



of the adjusted values qjf-' ~ <?fc/(l + fk/qk)- Outside of 
A\ U Bi.n, one has 



forfc' = l,. 

where, for each k, 

3k* = ?fc/(! 
Recall that from Lemma |4] that, 
1 



,fc-l, 



(32) 



fk/qk) 



i s-adj.tot 

1 - ?fc-i f 



-Qk 



1 

adj,tot 



From relation ( |32j >, one has > w k i, for fc' 
where idi = 1, and for k > 1, 
1 



1 



w k = 



i adj. tot 

1 - Qk-i v 



1(X 



adj,tot 



adj 



, k), used in forming the statistics Z k °j nb , 



Here, for each k, we take q^' tot = + . . . -\- q k 

Using this Wk we define the corresponding vector of weights 

{Xk'.k ■ k' = 1, 
as 

Wk' 

Wl + w 2 + ■ ■ ■ + w k 

Given that the algorithm has run for k — 1 steps, we now 
proceed to describe how we characterize the distribution of 
Z k °" lb for the fcth step. Define the additional exception event 

D hk -i = U^Dk- , with D k = {Xl/n<l-h}, 

where < h < 1. Here the term X$ is as given in Lemma 
[4] It follows a Chi-square distribution with d k = n — k + 1 
degrees of freedom. Define 

E k -i — A 1:k ^i U -Bi,fc-i U Di t k-i' 

Notice that we have for j G sent that 



2 k ', j = ^/wk'C'j^R (X dk , j y/n) + Z k 'j 
and for j G other, we have 

Zk',j = Zk\j, 

for k' = 1, . . . , k. Further, denote Cj t n,h = Cj,i?(l — h). Then 
on the set E kl n D k , we have for k! — 1, . . . , k that 



Z k ',j > \fwk' \fCjjiJi + z k',j for J G sent. 
Recall that, 

Ai-^xj + X 2 Z 2 j + ... + X k Z k j, 



<3-comb 
^k,j 



where for convenience we denote \ k i k as simply X k /. Define 
for each k and j G J, the combination of the noise terms by 



From the above one sees that, for j G other the Z k lo ™ ,b 
equals Z£ omb , and for j G sent, on the set Ei, n the 
statistic Z k °" exceeds 



which is equal to 



C 



j,R,h 



adj. tot 



Summarizing, 



crcomb 



rycomb 



and, on the set E%_ x n D%, 
h,j + 



Zl°f> > shift* 



rycomb 

z k,j 



j ^comb 

for j € other 

for j G sent, 



where 



shift. 



C 



k,3 



j,R,h 



1 - X k -\ v 

with xq — and = q^f 01 , for k > 2. Since the a^'s are 
increasing, the shift* j's increases with A:. It is this increase in 
the mean shifts that helps in additional detections. 
For each j G J, set Hkj to be the event, 

H k j — (shiftfejl^ esent } + Z k j > t} . 

Notice that 



H 



k,j 



{z k T b > 



for j G other. 



(33) 
(34) 



(35) 



On the set E k l n D c k , defined above, one has 

H kd C {Z, c ° mb > r} for j G seni. 

Using the above characterization of Z k °" lb we specify in the 
next subsection the values for q% k , f k and q k . Recall that 



the quantity qi k , which was defined is subsection II-B gave 



controls on size\, k , the size of the decoded set dec\ jk after 
the k step. 

E. Specification of f k , q\ >k , and q k , for k = 1, . . . ,m 

Recall from subsection |IV-C| that under the Q measure that 
Z k j, for j G J, are i.i.d. standard normal random variables. 
Define the random variable 



tup 
Jk 



j^Lother 



(36) 



Notice that f k < f k p since 

fk= ^2 n i 

j ^deckDother 

^ Z! *j 1 {z^>t}- (37) 

j£other 

The above inequality follows since dec k is a subset of 
thresh k — {j : Z^ o p b > t} by construction. Further ( |37j ) 
is equal to f^ p using (|34jl 



The expectation of f k p under the Q-measure is given by, 



ryCOmb 



Ai^ij- + X 2 Z 2 j + . . . + X k Z k j. 



E, 



■Q (/r) = (M-l)*(r), 
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where $(r) is the upper tail probability of a standard normal 
at t. Here we use the fact that the Hkj, for j € other, are i.i.d 
Bernoulli $(r) under the Q-measure and that J^jeother^j * s 
equal to (M — 1). 

Define /* = (M— l)$(r), which is the expectation of f k p 
from above. One sees that 

r < exp{-a72loiM-(l/2) Q 2 } 
(V21ogM + a)V2vr 

using the form for the threshold t in Q. We also use that 
<&(x) < 4>(x)/x for positive x, with cf) being the standard 
normal density. We take fk = f to be a value greater than /*. 
We express it in the form 

f = pf, 

with a constant factor p > 1, This completes the specification 
of the f k . 

Next, we specify the qi jk used in pacing the steps. Denote 
the random variable, 



qi,k = ".''»• • 

j sent 



(39) 



Likewise, define k as the expectation of qi.k under the 
measure. Using (|33"j), one has 



Ul,k = ^M-^k.j); 

j sent 

where pk.j = shifty j — r. Like before, we take qy^ to be a 
value less than q[ k . More specifically, we take 



9i,fc = 9i,k - V 



(40) 



for a positive rj. 

This specification of q[ k , and the related q\ t k, is a recursive 
definition. In particular, denoting 



Then g* k equals the function 



9l(x) = Y 7r j $ (-A 1 j( a; )) 



(41) 



j sent 

evaluated at Xk-i = q k _i 0t , with xq = 0. 

For instance, in the constant power allocation case Cj^.h = 
(i?o(l — h)/R)2\ogM, is the same for all j. This makes 
shifty the same for each j. Consequently, fij(x) = [i{x), 
where /i(x) = — xv) ^JUJj^h — T - Then one has 

lik = MC^fc-i))' ^ obeys the recursion q{ k — gh{x) 
evaluated at Xk-i — q k j!i 0t , with <?l(x) = /x(a;)). 

Further, we define the target detection rate for the fc th step, 
given by q k , as 



qk = qi,k - qi,k-i - l/L„ - f, 



(42) 



with qx o taken to zero. Thus the qt are specified from the 
<7i,fc and /. Also, — mm ^=i,....L ^(f) is a quantity of 

order l/L. For the power allocation ((3), one sees that = 
Lv/{2C). 



F. Building Up the Total Detection Rate 

The previous section demonstrated the importance of the 
function <?l(x), given by ( |4Tj ). This function is defined on [0, 1] 
and take values in the interval (0, 1). Recall from subsection 
II-B on pacing the steps, that the quantities qi ^ are closely 



related to the proportion of sections correctly detected after k 
steps, if we ignore false alarm effects. Consequently, to ensure 
sufficient correct detections one would like the gi^ to increase 
with k to a value near 1. Through the recursive definition of 
qi.fc, this amounts to ensuring that the function gL{%) is greater 
than x for an interval [0, x r ], with x r preferably near 1. 

Definition: A function g(x) is said to be accumulative for 
< x < x r with a positive gap, if 

g(x) ~x>gap 

for all < x < x r . Moreover, an adaptive successive decoder 
is accumulative with a given rate and power allocation if 
corresponding function gh{x) satisfies this property for given 
x r and positive gap. 

To detail the progression of the qi ^ consider the following 
lemma. 

Lemma 6. Assume g(x) is accumulative on [0, x r ] with a 
positive gap, and 77 is chosen so that gap — i] is positive. 
Further, assume 



f< (ga P -7 1 ) 2 /8-l/(2L v ). 



(43) 



Then, one can arrange for an m so that the gifc, for k = 
1, . . . , m, defined by (HDJ, are increasing and 



qi, m >x r + gap - i]. 

Moreover, the number of steps m is at most 2/(gap — 77). 

The proof of Lemma [6] is given in Appendix [C] We now 
proceed to describe how we demonstrate the reliability of the 
algorithm using the quantities chosen above. 

V. Reliability of the Decoder 

We are interested in demonstrating that the probability of 
the event A\, m U B\^ m is small. This ensures that for each 
step k, where k ranges from 1 to m, the proportion of correct 
detections q k is at least qk, and the proportion of false alarms 
fk is at most ff. = f. We do this by demonstrating that the 
probability of the set 

E m — Ai^ n U B\ tTn U Di m 

is exponentially small. The following lemma will be useful in 
this regard. Recall that 

Ai, m = UkLiiQk < Qk} and B ljn = u; n = i{/fc > /}■ 

Lemma 7. Let q\,ki Qk and / be as defined in subsection 
HV^El Denote 

M, m = U^ =1 {gi, fe < q hk } and B ljm = U^ =1 {/" p > /}. 



Then, 



E m C Ai m U Bi m U Di 
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For ease of exposition we provide the proof of this lemma 
in Appendix [D] The lemma above described how we control 
the probability of the exception set E m . 

We demonstrate that the probability of E m is exponentially 
small by showing that the probability of A\_ m U B\, m U Di >m , 
which contains E m , is exponentially small. Also, notice that 
outside the set E m , the weighted fraction of failed detection 
and false alarms, denoted by 8 wg ht in pi) , is bounded by 

n I 

(i - ^2<ik) + m .f, 

k=l 

which, after recalling the definition of in ( |42| ), can also be 
expressed as, 



1 - qi, m + 2m f + m/L n 



(44) 



Now, assume that gi is accumulative on [0, x r ] with a positive 
gap. Then, from Lemma|6| for rj < gap, and / > /* satisfying 
( |43| l, one has that ( |44] i is upper bounded by 



S wg ht = (1 - %r) - {gap ~ »?)/2, 



(45) 



using the bounds on /, qi, m and to given in the lemma. 
Consequently, 5 m i S the mistake rate after to steps, given by 
Q, is bounded by 5 m i S outside of Ai. m UBi jm UOi. m , where, 

STLT 

dmis = -^rii 1 ~ x r) ~ (dap - r?)/2], (46) 

via $22) . We then have the following theorem regarding the 
reliability of the algorithm. 

Theorem 8. Let the conditions of Lemma|6]hold, and let 5 mia 
be as in ( |46) , Then, 

F(S mi8 > S mis ) < me- 2i '" 2+mc » +me -^fD( P )/ P 



Here the quantities rj and /? are as defined in subsection |IV-E| 
and c is as given in Lemma [5] Also D(p) = plogp— (p— 1). 

Proof of Theorem Jsj 1 From Lemma [7] and the arguments 
above, the event {8 m i S > S m i S } is contained in the event 

A 1>m U B 1>m U £>! 

Consequently, we need to control the probability of the above 
three events under the P measure. 

We first control the probability of the event Z?i im , which is 
the union of Chi-square events = {Xj k /n < 1 — h}. Now 
the event can be expressed as {X^/d^ < 1 — h).}, where 
hk = {nh — k + l)/(n — k + 1). Using a standard Chemoff 
bound argument, one gets that 

f(D k ) < e -(«-fc+i)^/2. 

The exponent in the above is at least (n — k + l)h 2 /2 — kh. 
Consequently, as fc < to, one gets, using a union bound that 



P(£>l,m) < 



3 — {n— m-\-X)h /2+m/i 



Next, lets focus on the event Bi m , which is the union of 
events > /}. Divide /, by M - 1 to get p k , p 



respectively. Consequently, B\ m is also the union of the 
events {p k > p}, for k — 1, . . . , to, where 

j (pother 

and p — f/ (M — 1), with / = pf*. 

Recall, as previously discussed, for j in other, the event 
Hk,j are i.i.d. Bernoulli(p*) under the measure P, where 
p* = f* /(M — 1). Consequently, from by Lemma 13 in the 



Appendix [El the probability of the event {p k >p} is less than 
e -L„(M-ljTf(p\\ P *)_ Therefore, 



(B 



l,m 



< me 



-L„(M-l)D(p\\p*) 



To handle the exponents (M— l)D(p\\p*) at the small values 
p and p* , we use the Poisson lower bound on the Bernoulli 
relative entropy, as shown in Appendix |F| This produces the 
lower bound (M —l)[p log p/p* + p* — p], which is equal to 

/log///* + /*-/• 

We may write this as f*D(p), or equivalently JD(p)/p, 
where the functions D(p) and D(p)/p = \ogp + 1 — 1/p 
are increasing in p. 

Lastly, we control the probability of the event Ai >m , which 
the is union of the events {qi,k < qi,k}, where 



qi.k 



j£sent 



We first bound the probability under the Q measure. Recall that 
under Q, the H k j, for j £ sent, are independent Bernoulli, 
with the expectation of q\ & being k . Consequently, using 
Lemma 13 in Appendix [E] we have 



i,fc <quk) <e~ L " D ^^*\ 

Further, by the Pinsker-Csiszar-Kulback-Kemperman inequal- 
ity, specialized to Bernoulli distributions, the expressions 
D(qi,k\\q* k) m tne a b° ve exceeds 2(qi k — q{ fc ) 2 , which is 
2n 2 , since q{ k - qi,k = V- 



Correspondingly, one has 



i, m ) < me 



Now, use the fact that the event A\ tm is T m measurable, along 
with Lemma [5] to get that, 

„) < me- L ^ 2+mco . 



This completes the proof of the lemma. ■ 

VI. Computational Illustrations 

We illustrate in two ways the performance of our algorithm. 
First, for fixed values L, B, snr and rates below capacity we 
evaluate detection rate as well as probability of exception set 
p e using the theoretical bounds given in Theorem [8] Plots 
demonstrating the progression of our algorithm are also shown. 
These highlight the crucial role of the function g^ in achieving 
high reliability. 

Figure [3] presents the results of computation using the 
reliability bounds of Theorem [8] for fixed L and B and various 
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g L (x) 



C = 0.5 bits 

R = 0.3 bits (0.59C) 

No. of steps = 7 



Fig. 3. Plots demonstrating progression of our algorithm. (Plot on left) snr = 15. The weighted (unweighted) detection rate is 0.995 (0.985) for a failed 



detection rate of 0.014 and the false alarm rate is 0.005. (Plot on right) snr ■ 
alarm and failed detection rates are 0.016 and 0.055 respectively. 



1. The detection rate (both weighted and un- weighted) is 0.944 and the false 



choices of snr and rates below capacity. The dots in these 
figures denotes qx k> f° r eacn 

For illustrative purposes we take B — 2 16 , L = B and snr 
values of 1, 7 and 15. The probability of error p e is set to 
be near 10 -3 . For each snr value the maximum rate, over 
a grid of values, for which the error probability is less than 
p e is determined. With snr — 1 (Fig |3J, this rate R is 0.3 
bits which is 59% of capacity. When snr is 7 (Fig [2) and 
15 (Fig [3} , these rates correspond to 49% and 42% of their 
corresponding capacities. 

For the above computations we chose power allocations of 
the form 



P(g) oc max{e 



-2-yl/L 



u}, 



with < 7 < C, and u > 0. Here the choices of a, u and 7 
are made, by computational search, to minimize the resulting 
sum of false alarms and failed detections, as per our bounds. 
In the snr = 1 case the optimum 7 is 0, so we have constant 
power allocation in this case. In the other two cases, there 
is variable power across most of the sections. The role of a 
positive u being to increase the relative power allocation for 
sections with low weights. 

Figure |4] gives plots of achievable rates as a function of B. 
For each B, the points on the detailed envelope correspond to 
the numerically evaluated maximum inner code rate for which 
the section error is between 9 and 10%. Here we assume L to 
be large, so that the qi^ and fk are replaced by the expected 
values q\ k and /*, respectively. We also take h = 0. This 
gives an idea about the best possible rates for a given snr and 
section error rate. 

For the simulation curve, L was fixed at 100 and for 
given snr, B, and rate values, 10 4 runs of our algorithm 
were performed. The maximum rate over the grid of values 
satisfying section error rate of less than 10% except in 10 
replicates, (corresponding to an estimated p e of 10~ 3 ) is shown 
in the plots. Interestingly, even for such small values of L, the 
curve is is quite close to the detailed envelope curve, showing 



that our theoretical bounds are quite conservative. 

VII. Achievable Rates approaching Capacity 

We demonstrate analytically that rates R moderately close to 
C are attainable by showing that the function Ql{x) providing 
the updates for the fraction of correctly detected terms is in- 
deed accumulative for suitable x r and gap. Then the reliability 
of the decoder can be established via Theorem [8] In particular, 
the matter of normalization of the weights 7I7q is developed in 
subsection VII-A An integral approximation g(x) to the sum 
§l (x) is provided in subsection VII-B and in subsection VII-C 



we show that it is accumulative. Subsection IVII-DI addresses 
the issue of control of parameters that arise in specifying the 
code. In subsection VII-E we give the proof of Proposition [T] 



A. Variable power allocations 

As mentioned earlier, we consider power allocations Pm 
proportional to e ~ 2Ce / L . The function gh{x), given by pi) , 
may also be expressed as 

L 

9l{x) = ^2^(£) $(ti(x,u t C'/R)), 

where — P(e\/P, and, 

H{x,u) = (y/u/(l - xv) - 1)1/2 log M - a 

and 

Ut = e -*c{i-i)/L and C '=C(l-h), 

with C as in ( |24"| >. Here we use the fact that Cj } n, for the above 
power allocation, is given by ugC/R if j is in section I, as 
demonstrated in ( |25j ). 

Further, notice that 717^ = ui/ sum, with sum — Y^e=i u t- 
One sees that sum = Lv/(2C), with v = P/(P + a 2 ). Using 
this one gets that 

L 



2C 

9l{x) = -j ^2 ui$(fj,(x,utC'/R)). 



vL 



(47) 
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Fig. 4. Plots of achievable rates as a function of B for snr values of 15, 7 and 1. Section error rate is controlled to be between 9 and 10%. For the curve 
using simulation runs the error probability of making more than 10% section mistakes is taken to be 10 — 3 . 



B. Formulation and evaluation of the integral g(x) 

Recognize that the sum in ( |47| ) corresponds closely to an 
integral. In each interval <t<j- for £ from 1 to L, we 
have e~ 2C ~r at least e~ 2Ct . Consequently, gh{x) is greater 
than g(x) where 



provided z l ° w < z < z" lax . Here u x = 1 — xv. Thereby, for 
all z, the length of this interval of values of u can be written 

as 



J max { ( 1 



z + a 



2C 



■y iiiuj y \ ) 1J ^ v 

[ e- 2et $(ti(x,e- 2et C '/R))dt. (48) 1 f \ f , 

Jo — 1 — max < iij — (1 



C V V21ogM 
Thus g(x) is equal to, 



R 



z+a 



The gh{x) and g(x) are increasing functions of x on [0, 1] 

Let's provide further characterization and evaluation of the 
integral g(x). Let 

z l ° w = fi{x, (1 - v)C'/R) and z™ ax = n(x,C'/R). 



C'\ V21ogMy + 



1 -f 



Further, let (5 Q = a/y/2 log M. For emphasis we write out that 



z x = z l ° w takes the form 



vgpgZg _ (1 + , a) ^2^. (49) 
Change the variable of integration in (|48]i from t to ii = 



-2Ct Observing that e 2C = 1 — v, one sees that 
l 



1 



<&{n(x,uC /R))du. 



Now since 



it follows that 



(50) 

In ( |50l >, the inequality 

2 < /i(x, uC' /R) 



is the same as 



> y/UzR/C {l + (z + a)/y/2 logM ) , 



<j){z)dz. 
(51) 



Lemma 9. Derivative evaluation. The derivative g'(x) may 
be expressed as 



Further, if 



R 
C 7 



R = 



(l + S a +S z ) 2 (/)(z)dz. 

law 

C[ 

[(l + 5 a ) 2 (l+r/logM)]' 



with r > Tq, where 



1 



To 



(52) 



(53) 



(54) 



2(1 + 6a) 2 ' 

then the difference g(x) — x is a decreasing function of x. 

Proof: The integrand in ( f5T) is continuous and piecewise 
differentiable in x, and its derivative is the integrand in |52|. 
Further, (T52b is less than, 



C 



{l+5 a + 5 z ) 2 <p{z)dz = - [(l+6 a ) 2 + 1/(2 log AO] . 



which is less than 1 for r > ro. Consequently, g(x) — x is 
decreasing as it has a negative derivative. ■ 

Corollary 10. A lower bound. The function g(x) is at least 

giow(x) 



1 {R/C')u x (1 +(z+a)/v / 21ogM) 2 | <j>(z)dz 



V J 7 lo 

This giow(x) is equal to 



1 1 

x + 5 R ^ [1 
v L 
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2(l + * )£" 



<t>{Zx) 



Ru x z x 4>(z x ) 



C v V2 log M C v 21ogM 



(55) 



where 



r 



n logM + r 
Moreover, this gi ow (x) has derivative g' low (x) given by 
R 
C 



l + S„ 



~ <fi(z)dz. 



Proof: The integral expressions for gi ow (x) are the same 
as for g(x) except that the upper end point of the integration 
extends beyond z' x nax ' , where the integrand is negative. The 
lower bound conclusion follows from this negativity of the 
integrand above z r x nax . The evaluation of gi ow {x) is fairly 
straightforward after using z<fi(z) = —<p'(z) and z 2 cf)(z) = 
<j>{z) — (z<fi(z))'. Also use that 3>(z) tends to 1, while <j>(z) 
and z(j)(z) tend to as z — > oo. This completes the proof of 
Corollary [10] ■ 
Remark: What we gain with this lower bound is simplification 
because the result depends on x only through z x = z l ° w . 

C. Showing g{x) is greater than x 

The preceding subsection established that Ql{x) — x is at 
least gi ow (x) — x. We now show that 

is at least a positive value, which we denote as gap, on an 
interval [0, x r ], with x r suitably chosen. 

Recall that z x = z l x ow , given by ( |49| ), is a strictly increasing 
function of x, with values in the interval Iq = [zo,zi] for 
< x < 1. For values z in Jo, let x — x(z) be the choice for 
which z x = z. With the rate R of the form ( |53| >, let x r be the 
value of x for which z x is 0. One finds that x r satisfies, 

1 r r 

snr log M _ 

We now show that hi ow (x) is positive on [0, x r ], for r at least 
a certain value, which we call r\. 

Lemma 11. Positivity of hi ow (x) on [0, x r ]. Let rate R be of 
the form d53]i, with r > r\, where 



1 



(56) 



n = r /2 + 



%/log M 



(57) 



Then, for < x < x r 
or equal to 

gap ■■ 



the difference hi ow {x) is greater than 



1 



logM 



(58) 



Proof of Lemma 11 • The function g[x) has lower bound 
9iow{x). By Corollary 10 gi ow {x) has derivative bounded by 



(z)dz = 



(l + r /logM) 
(1 + r/logM) ' 
(x) — x is decreasing 



(l+5g+6 z ) 

, (l + <y 2 (l + r/logM)' 

which is less than 1 for r > tq. Thus gi Q 
as it has a negative derivative. 

To complete the proof, evaluate gi ow (x) — x at the point 
x — x r . The point x r is the choice where z x — 0. After using 
( |55| l, it is seen that the value gi ow (x r ) — x r is equal to gap, 
where gap is given by d58j. ■ 



D. Choices of a and r that control the overall rate drop 

Here we focus on the evaluation of a and r that optimize 
our summary expressions for the rate drop, based on the lower 
bounds on g^(x) — x. Recall that the rate of our inner code is 

1 - h 

R = C- 



(l + <5 a ) 2 (l + r/logAf)' 

Now, for r > r\, the function gh{x) is accumulative on [0, x r ], 
with positive gap given by ( |58j ). Notice that n, given by ( |57] i, 
satisfies, 

n < 1/4 



%/IogM 



(59) 



Consequently, from Theorem |8] with high reliability, the 
total fraction of mistakes S m i S is bounded by 



2C 



[(1 - x r ) - (gap-rj)/2) 



If the outer Reed-Solomon code has distance designed to be 
at least S m i S then any occurrences of a fraction of mistakes 
less than S m i S are corrected. The overall rate of the code is 
RtotaU which is at least (1 - S mis )R. 

Sensible values of the parameters a and r can be obtained 
by optimizing the above overall rate under a presumption of 
small error probability, using simplifying approximations of 
our expressions. Reference values (corresponding to large L) 
are obtained by considering what the parameters become with 
V = 0, / = /*, and h = 0. 

Notice that a is related to /* via the bound ( [38] l. Set a so 

that 

~ (60) 



a a/2 log M = log l/(f*V2n^2logM) 



We take /* as gap 2 / 8 as per Lemma [6] Consequently a will 
depend on r via the expression of gap given by ( f58] l. 

Next, using the expressions for 1 — x r and gap, along with 
7] = 0, yields a simplified approximate expression for the 
mistake rate given by 

_ r + ri 
° mis ~ 4ClogAf 
Accordingly, the overall communication rate may be ex- 
pressed as, 

■ri \ C 



= 1— 



n '"'"' 1 ' 4C log M/ (l+<5 a ) 2 (1+r/logM)' 

As per these calculations (see ll22l for details) we find it 
appropriate to take r to be r* , where 

r* = n + 2/(1 + 1/C). 

Also, the corresponding a is seen to be 

a - (3/2) log(log(M))/V21og(M) + 5, 

where, 

. = 2 log [ S nr(l + l/C)/((7r)- 25 )] 
V21og(Af) 

Express Rtotai in the form C/(l + drop). Then with the 
above choices of r and a, and the bound on r\ given in ( f59] l, 
one sees that drop can be approximated by 

3 log log M + 4 log(wi snr) + 1/(4C) + 3.35 1 + 1/(2C) 



2 log M 



^/ir log Af 
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where u)\ = 1 + 1/C 

We remark that the above explicit expressions are given to 
highlight the nature of dependence of the rate drop on snr and 
M. These are quite conservative. For more accurate numerical 
evaluation see section [VI] on computational illustrations. 

E. Definition of C* and proof of Proposition^ 

In the previous subsection we gave the value of r and a that 
maximized, in an approximate sense, the outer code rate for 
given snr and M values and for large L. This led to explicit 
expressions for the maximal achievable outer code rate as a 
function of snr and M. We define C* to be the inner code rate 
corresponding to this maximum achievable outer code rate. 
Thus, 

C * = C - 

(l+S a ) 2 [1 + r*/ log MY 

Similar to above, C* can be written a C/(l + drop*) where 
drop* can be approximated by 

3 log log M + 41og(wi snr) + 4/wi - 2 1 

2 logM ~ + Vtt log M ' 

with lu\ = 1 + 1/C. We now give a proof of our main result. 
Proof of Proposition [TJ Take r = r* + k. Using 

(1 + k/ log M)(l + r*/ logM) > (1 + r/logM), 

we find that for the rate R as in Proposition [T] gap is at least 

(r — ri)/ (snr log M) for x < x r , with x r = rj (snr log M). 

Take /* = (l/8)(r* - n) 2 /(snr log M) 2 , so that a is the 
same as given in the previous subsection. Now, we need to 
select p > 1 and n > 0, so that 

f = pf*< (gap-n) 2 /8-l/(2L„). 

Take tj = (1 + l/C)/2, so that r* = n + One sees 
that we can satisfy the above requirement by taking n as 



(l/2)re/(snrlogM) and p = (1 + kw/2) z 
(2usnr logM) 2 



is of order (logM) 2 /L, and hence is negligible compared 
to the first term in p. Since it has little effect on the error 
exponent, for ease of exposition, we ignore this term. We also 
assume that / = (gap — n) 2 /8, ignoring the 1/(2L 7T ) term. 
We select 

/ = — 
1 (21ogM) 3 / 2 ' 

The fraction of mistakes, 



snr 
~2C 



snr log M 



(gap -n) j2 



is calculated as in the previous subsection, except here we have 
to account for the positive r\. Substituting the expression for 
gap and rj gives the expression for S m i S as in the proposition. 

Next, let's look at the error probability. The error probability 
is given by 



me 



-2L n ri +mco 



-L 7r fD(p)/p 



me mh e- nh 12 . 



Notice that nh 2 /2 is at least (L w log M)h 2 /(2C*), where we 
use that L > L v and B < C*.Thus the above probability is 
less than 



Kl exp{-L T min{2r7 2 , f*D(p), h 2 log M/(2C*)}} 



with 



where for the above we use h < 1. 

Substituting, we see that 27y 2 is ( 1/2) k 2 /(snr log M) 2 and 
h 2 \ogM/(2C*) is 

1 k 2 



16C* (logM) 2 ' 

Also, one sees that D(p) is at least 2(^fp — l) 2 / p. Thus the 
term f*D(p) is at least 



(4snrlogM) 2 (l + ku/2) 

We bound from below the above quantity by considering 
two cases viz. k < 2/u> and k > 2/u>. For the first case 
we have 1 + klj/2 < 2, so this quantity is bounded from 
below by (\/2)n 2 /(Asnr log M) 2 . For the second case use 
k/(1 + kuj/2) is bounded from below by to get that this 
term is at least (l/bj)n/(Asnr logM) 2 . 

Now we bound from below the quantity 
min{2n 2 J*D(p),h 2 \ogM/(2C*)} appearing in the 
exponent. For k < 2/uj this quantity is bounded from below 

b y 



K>3 



(logM) 5 



where 



K 3 = min{l/(32sw 2 ), 1/(16C*)} . 
For k > 2/uj this is quantity is at least 



with 



K4 



(logM) 2 ' "logMj ' 



(1 + 1/C)snr 2 logM' 



Also notice that C* - R is at most C*K/logM. Thus we 
have that 



is at least 



min{2r7 2 ,/*^(p), h 2 \ogM/(2C*)} 



min |k 3 (A*) 2 , k 4 A*} 



Further, recalling that L n — Lv/(2C), we get that K2 = 
v/(2C), which is near v/(2C). 

Regarding the value of to, recall that to is at most 2/(gap— 
rj). Using the above we get that to is at most (2ousnr) logM. 
Thus ignoring the 3m, term n\ is polynomial in M with power 
2ujsnr maxjco, 1/2}. 

Part II is exactly similar to the use of Reed-Solomon codes 
in section VI of our companion paper |24l . ■ 



In the proof of Corollary [2] we let Q, for integer i, be 
constants that do not depend on L. M or n. 
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Proof of Corollary |2j Recall R tot = (l — 5 mis )R. Using 
the form of S m i S and C* for Proposition [T| one sees that R tot 
may be expressed as, 



Rtnt — 1 



Ci<5m - C2, I C. 

logM, 



(61) 



Notice that M needs to be at least exp{(3/A 2 }, where A = 
(C — Rt ot )/C, for above to be satisfied. For a given section 
size M, the size of n would be larger for a larger C — Rtot- 
Choose k so that (^/logM is at least (i5m, so that by ( |6T) , 
one has, 



A>2( 2 



log M ' 



(62) 



Now following the proof of Proposition [T[ since the error 
exponent is of the form const min{K/ log M, (k/ log M) 2 }, 
one sees that it is at least const min{A, A 2 } from ( |62| l. ■ 

VIII. Discussion 

The paper demonstrated that the sparse superposition coding 
scheme, with the adaptive successive decoder and outer Reed- 
Solomon code, allows one to communicate at any rate below 
capacity, with block error probability that is exponentially 
small in L. It is shown in Q that this exponent can be 
improved by a factor of ^/log M from using a Bernstein bound 
on the probability of the large deviation events analyzed here. 

For fixed section size M, the power allocation |3]l ana- 
lyzed in the paper, allows one to achieve any R that is at 
least a drop of 1/yTogM of C. In contrast, constant power 
allocation allows us to achieve rates up to a threshold rate 
Ro = .5snr/ (1 + snr), which is bounded by 1/2, but is near 
C for small snr. In [22 1, [7| the alternative power allocation Q 
is shown to allow for rates that is of order log log Mj log M 
from capacity. Our experience shows that it is advantageous 
to use different power allocation schemes depending on the 
regime for snr. When snr is small, constant power allocation 
works better. The power allocation with leveling Q works 
better for moderately large snr, whereas <(3j is appropriate for 
larger snr values. 

One of the requirements of the algorithm, as seen in the 
proof of Corollary [2] is that for fixed rate Rtot, the section size 
M is needed to be exponential in 1/ A, using power allocation 
Q. Here A is the rate drop from capacity. Similar results 
hold for the other power allocations as well. However, this 
was not the case for the optimal ML-decoder, as seen in |24|. 
Consequently, it is still an open question whether there are 
practical decoders for the sparse superposition coding scheme 
which do not have this requirement on the dictionary size. 

Appendix A 
Proof of Lemma[4] 

For each k > 2, express X as, 



X = 



gi 

\Gi\ 



G 



k-l 



\G 



fc-i 



J k-1 



where £ fc = [f fcjfc : . . . : £ fc ,„] is an n x (n - k + 1) 
orthonormal matrix, with columns £ kt i, for i = k, . . . ,n, being 
orthogonal to G\, . . . , Gk-i- There is flexibility in the choice 



of the £fc,j's, the only requirement being that they depend 
on only Gi, . . . , Gk-i and no other random quantities. For 
convenience, we take these £ki s to come from the Grahm- 
Schmidt orthogonalization of G%, . . . , Gk-i and the columns 
of the identity matrix. 

The matrix Vk, which is (n — fc + 1) x N dimensional, is 
also denoted as, 



= [14,! : y M 



v k . 



N 



The columns V^j, where j = 1,...,N gives the coef- 
ficients of the expansion of the column Xj in the basis 
Zk,k,€k,k+i, ■ ■ ■ ,6c,n- We also denote the entries of V k as 
Vk,i,j, where i = k, . . . , n and j = 1, . . . , N. 

We prove that conditional on Tk-i, the distribution of 
(Vk,i,j '■ j G Jk-i), for i = k,...,n, is i.i.d. Normal 
N(0, Efc_i). The proof is by induction. 

The stated property is true initially, at k = 2, from Lemma 
[3] Recall that the rows of the matrix U in Lemma [3] are i.i.d. 
N(0, Si). Correspondingly, since V2 — £,2^, and since the 
columns of £2 are orthonormal, and independent of U, one 
gets that the rows of V2 are i.i.d N(0, Si) as well. 

Presuming the stated conditional distribution property to be 
true at k, we conduct analysis, from which its validity will 
be demonstrated at k + 1. Along the way the conditional 
distribution properties of G k , Zk,j, an d ^k,j are obtained 
as consequences. As for Wk and S k we first obtain them by 
explicit recursions and then verify the stated form. 

Denote as 



G 



coef 



jGdeck-i 



.,n 



(63) 



Also denote as, 



/^icoef /^coef ^coef ^coe/\T 

u k — \ u k,k ) u k,k+l> ■ ■ ■ ' W.n I ■ 

The vector G c k oe ^ gives the representation of G k in the basis 
consisting of columns vectors of £ k . In other words, G k = 



Notice that, 



\GT S \ 



(64) 



Further, since V k j and G c k ° e ^ are jointly normal conditional 
on Tk-\, one gets, through conditioning on G^ oe ^ that, 

V k ,j = b k -i,j G c k oef /a k + U kJ . 

Denote as U k = [Uk,i ■ U k .2 ■ ■ ■ ■ ■ U k ,N], which is an 
(rt — k + l)x N dimensional matrix like V k . The entries of U k 
are denoted as U k .ij, where i = k, . . . ,n and j = 1, . . . , N. 
The matrix U k is independent of G c k ° e ^ , conditioned on J^k-i- 



(65) 



Further, from the representation ( |64| i, one gets that 

Z k ,j = bk-i,j ||G™ e \\/a k + Z ki j, 

with, 



\G k ef \ 



For the conditional distribution of G c k °^ given J-~ k -i, inde- 
pendence across i, conditional normality and conditional mean 
are properties inherited from the corresponding properties of 
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the Vk.ij- To obtain the conditional variance of G c k oe J , given 
by ( |o*3| >, use the conditional covariance 

£fe-i=-f-£fc-i#fc_i 

of (Vk,i j ■ j G Jfc-i)- The identity part contributes 
J2 3 edec k ^ 1 p j which is + fk-i)P; whereas, the 

5k-i$k-i P art > using the presumed form of Sk-i, contributes 
an amount seen to equal v k -i[T, je sentndec^ 1 p j / p ? p 
which is i'k-iql_ 1 P- It follows that the conditional expected 
square for G c k oe J , for i = k, . . . , n is 

CTfc = [qk-l+fk-l-Qk-l U k-l]P- 

Conditional on J-'k-i, the distribution of 



|G 



i — k 



is that of <j\ X 2 _ k+1 , a multiple of a Chi-square with n—k + 1 
degrees of freedom. 

Next we compute bk-i,j in ( |65j ), which is the value of 



Wk,i,jGTf\^k-i]/^k 

for any of the coordinates i = k, . . . ,n. Consider the product 
Vk,i,j G'ki* m me numera t° r - Using the representation of 

G Tf in & one has Wk,i,jG c k f\T k -i] is 



i'Gdec fc _i 



which simplifies to - y/P] [l J - edect! _ 1 - sent ] . So 

for j in Jfc = Jfe-i — deck-i, we have the simplification 



gfc-l Vk-lfij 



(66) 



Also, for j, j' in Jfc, the product takes the form 

°fc-lj°fc-lj' — Cfc-l,jOfe-l,j' 5 . • 

1 + fk-i/Qk-i - <lk-\Vk-\ 

Here the ratio simplifies to (7^^^-1/(1 — qili^k-i)- 

Now determine the features of the joint normal distribution 
of the 

Uk,i,j — Vk,ij — bk-i,j /<7k, 

for j £ J fe , given Tk-x- Given T k -i, the (U k ,i,j : j € 
Jfc) are i.i.d across choices of i, but there is covariance 
across choices of j for fixed i. This conditional covariance 
^[Uk.i,jUk.i,j'\J~k-i], by the choice of bk-i,j, reduces to 
^■[Vk,i,jVk,i,j'\IFk-i] - h-i,jh-i,j< which, for j e J fc , is 

lj=j' — Sk-i,jSk-i,j' — bk-i,jbk-i,j<- 

That is, for each j, the (Uk,i,j '■ j € Jfc) have the joint 
7Vj fc (0, Efc) distribution, conditional on Fk-i, where Efc again 



takes the form 1, 



S k ,jSkj 



where 



fik,j$k,j' — Sk-i,jSk—i,j> < 1 + 



-Qk-^k-i 



for j, j' now restricted to Jfc. The quantity in braces simplifies 
to 1/(1 — q'?z 1 Vk-i). Correspondingly, the recursive update 
rule for v k is 

Vk = 



Vk-\ 



1 



Consequently, the joint distribution for (Zk,j ■ j € Jfc) 
is determined, conditional on J-'k-i- It is also the normal 
iV(0,Efc) distribution and (.Zfcj : j € Jfc) is conditionally 
independent of the coefficients of G k oe ^ , given Tk-i- After 
all, the 



-tcoe/ 1 



have this Nj k (0, Efc) distribution, conditional on G c k ° e ^ and 
Jfc_i, but since this distribution does not depend on G k oe 
we have the stated conditional independence. 

This makes the conditional distribution of the Zk.j, given 
Jfc_i, as given in ( |o*5) l, a location mixture of normals with 
distribution of the shift of location determined by the Chi- 
square distribution of X 2 _ k+1 = \\G c k oef \\ 2 / a 2 k . Using the 
form of 6fc_i.j, for j in J k , the location shift b k -i,j X n _ 
may be written 



fc+i 



y/w k Cj, R [X n -k + i/Vn] 1 



j sent ) 



where 



Wk 



k,j 



c 



The numerator and denominator has dependence on j through 
Pj, so canceling the Pj produces a value for u>k- Indeed, 
Gj, R = (Pj/P)u(L/R) logM equals n(Pj/P)v and b\_ ld = 
PjQk-i v k-xl\^ — Qk-i^k-i]- So this Wk may be expressed as 



Wk = 



^adj 

Vk-1 %-l U k-\ 
a 1 -ad?" 
V 1 ~ %-\ v k-\ 



which, using the update rule for v k _i, is seen to equal 



w k = 



V 



Further, repeatedly apply v^/vy-i = V( 1_ C'-i v k'-i), for 
k' from k to 2, each time substituting the required expression 
on the right and simplifying to obtain 



Vk 
Vk-l 



1- ($* + ... + %* 2 ) v 



-adj 



+ Qk-2 + <lk-l) V 



1 - 

This yields Vk — vsk, which, when plugged into the expres- 
sions for lik, establishes the form of Wk as given in the lemma. 

We need to prove that conditional on Jfc that the rows of 
Vfc+i, for j e J k , are i.i.d. N Jk (0, E fc ). Recall that V k+ i = 
£, k+1 X. Since the column span of £k+i is contained in that 
of £fc, one may also write Vk+i as ££ +1 £fcVfc. Similar to the 
representation Gfc = (,kG c k oe , express the columns of £fc + i in 
terms of the columns of as ^fc + i = ffc^° e ^, where £™ e ^ 
is an (n — k + 1) x (n — k) dimensional matrix. Using this 
representation one gets that Vk+i — (£fc° e ^) T Vfc. 

Notice that ^fc is J"fc_i measurable and that ^fc + i is 
cr{J-fc_i, Gfc} measurable. Correspondingly, is also 

o-{J r fc_i,Gfc} measurable. Further, because of the orthonor- 
mality of £fc and £k+\, one gets that the columns of £™ are 
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also orthonormal. Further, as Gk is orthonormal to £fc+i, one 



has the G k is orthogonal to the columns of £ k 3 as well. 

Accordingly, one has that V k +i = (C k ° ef YU k . Conse- 
quently, using the independence of Uk and G c k ej , and the 
above, one gets that conditional on a{Fk-i, Gk], for j € J k , 
the rows of V fe+1 are i.i.d. Nj k (0, E fc ). 

We need to prove that conditional on the distribution of 
Vfe+i is as above, where recall that T k = a{T k -\,G kl Zk\ 
or equivalently, cr{J r / r._ 1 , Gk, Zk}- This claim follows from 
the conclusion of the previous paragraph by noting that 
Vfc+i is independent of Z k = (G c k e ) T U k , conditional on 
a{T k -\, G k } as G c k oef is orthogonal to ^° e ^ '. 

This completes the proof of the Lemma [4] 

Appendix B 
The Method of Nearby Measures 

Let b e R", be such that ||6|| 2 = v < 1. Further, let P be 
the probability measure of a N(0, E) random variable, where 
E = I — bb T , and let Q be the measure of a N n (0, 1) random 
variable. Then we have, 



Lemma 12. 



F[A] < 



where cq — —(1/2) log(l — v). 

Proof: If p(z), q(z), denote the densities of the ran- 
dom variables with measures P and Q respectively, then 
max z p(z)/q(z) equals 1/(1 — v) 1 ' 2 , which is also e c °. From 
the densities N(Q, I — bb T ) and N(0,I) this claim can be 
established from noting that after an orthogonal transformation 
these measures are only different in one variable, which is 
either iV(0, 1 — v) or N(0,1), for which the maximum ratio 
of the densities occurs at the origin and is simply the ratio of 
the normalizing constants. 
Correspondingly, 



p(z) dz 



V(A) 



< e c ° / q(z) dz = 

J A 

This completes the proof of the lemma. ■ 

Proof of Lemma HJ We are to show that for events A 
determined by the random variables pi) , the probability 
F[A] is not more than Q[A]e fcc °. Write the probability as an 
iterated expectation conditioning on J- k -i- That is, ¥[A] = 
E pP[A|J-jt_i]]. To determine membership in A, conditional 
on J-)c-i, we only need Zk,j k = (Zkj : j £ 4) where Jk is 
determined by T k -\. Thus 



¥[A] = E P 



where we use the subscript on the outer expectation to 
denote that it is with respect to P and the subscripts on 
the inner conditional probability to indicate the relevant vari- 
ables. For this inner probability switch to the nearby measure 
Qx dk ,z k j k \J r k -i- These conditional measures agree concern- 
ing the distribution of the independent <f J , so what matters 



is the ratio of the densities corresponding to Vz k r \j^ h _ 1 and 

We claim that the ratio of these densities in bounded by 
e C0 . To see this, recall that from Lemmaplthat Pz k j l-F^-i i s 
iV Jfe (0,E fe ), with E fc =I-8 k S%. Now 



II4H 2 



jGsentnJ k 



which is (1 — (gi + . . . + qk-i)) v k- Noting that Vk = s^v and 
Sfe(l — (<?i +. . . + <?fc-i)) is at most 1, we get that | |<5fe|| 2 < v. 

So with the switch of conditional distribution, we obtain a 
bound with a multiplicative factor of e c °. The bound on the 
inner expectation is then a function of J-k-i, so the conclusion 
follows by induction. This completes the proof of Lemma [5] 



Appendix C 
Proof of Lemma[6] 

For k = l, the q^i — 5(0) — 77 is at least gap — r\. Consider 



9L{qk-\ 0t ) ~ f° r k > 1. Notice that 



c^ ot >X>-(fc-i)/, 

k' = l 

using q/(l + f /q) > q — f. Now, from the definition of q k in 
d42]>, one has 



gi, fc _i-(/E-l)(/ + l/L w ). 



fe-i 

fe'=i 

Consequently, 

ql d l( 0t > qi, k -l - (k - 1)(2/ + 1/L n ). (67) 

Denote m as the first k for which q^z'l^ exceeds x r . For any 
k < m, as q k d ^ ot < x r , using the fact that is accumulative 
till x r , one gets that 

\ adj. tot . 

qi.k > q k -i + gap - V- 

Accordingly, using ( |67| i, one gets that 

qi,k > qi,k-i - (k - 1)(2/ + 1/L«) + gap - 77, (68) 

or in other words, for k < m, one has 

<lx,k ~ <?i,fc-i > -m(2/ + 1/L n ) + gap - r\. 

We want to arrange the difference qi k — qi t k-i to be at least a 
positive quantity which we denote by A. Notice that this gives 
m < 1/A, since the qi.fc's are bounded by 1. Correspondingly, 
we solve for A in, 

A = -(l/A)(2/ + l/i 7r )+ 5 ap- ?7 , 
and see that the solution is 



A 



{gap - 77) 



1+ 1 



, (2/ + !/£.) 
(gap-r)) 2 



1/2 



(69) 



which is well defined since / satisfies ( |43| l. Also notice that 
from ( |69] l that A > (gap — 77) /2, making m < 2/ (gap — rf). 
Also q\,m = 9L{qm-i*)~ V' which is at least gh(x r ) — V since 
<7i is increasing. The latter quantity is at least x r + gap — rf. 



21 



Appendix D 
Proof of Lemma[7] 
We prove the lemma by first showing that 

Ai, m ci lim U5i, m Ufli, m . (70) 

Next, we prove that B lm is contained in B\. m . This will prove 
the lemma. 

We start by showing ( f70] i. We first show that on the set 

{?u>?i,*}n^_in^ 

condition ((18), that is, 



E 



E 

je J-deci fe _! 



7T J -l {Z co„ 1 b> r} > <7l, fc , 



(71) 
(72) 



is satisfied. Following the arguments of subsection II-B 



garding pacing the steps, this will ensure that the size of the 
decoded set after k steps, that is size\^ k , is near qx ik , or more 
precisely 

gi,fc - 1/L n < sizei t k < qi,k, (73) 

as given in ( [17) , 

Notice that the left side of d72li is at least 



j£sent 

since the sum in ( |72| ) is over all terms in j, including those 
in sent, and further, for each term j, the contribution to the 
sum is at least irjl{ Z comb >T y, 
Further, using the fact that 

flkj C {Z%f> > r} on EU fl 

from d35j, one gets that, 



j£sent 



7r j 1 {z-°p b >T} > 9i, fe 



on n 



Correspondingly, on the set ( f7Tj ) the inequality ( f72] >, and 
consequently the relation f73] l also holds. 
Next, for each k, denote 

£ t =4 t :U%U%. (74) 

We claim that for each k = 1 , . . . , to, one has 

We prove the claim through induction on fc. Notice that the 
claim for k = to is precisely statement f70] i. Also, the claim 
implies that C E k , for each A:, where recall that E k = 

4tU%u%. 

We first prove the claim for k = 1. We see that, 

^i c = {<7i,i >gi,i}n{/i </}nz?j. 

Using the arguments above, we see that on {q%,i > 9i.i}n£)J, 
the relation q\ f \ — \jh v < size\^\ holds. Now, since size\,\ = 
<7i + /i> one gets that 

<7i > <?i,i - fi - l/L v on {qi.i > q 1}1 } n DJ. 

The right side of the aforementioned inequality is at least q\ 
on E%, using /i < /. Consequently, the claim is proved for 

Jfc = 1. 



Assume that the claim holds till k — 1, that is, assume that 
E%_ x C A° fe _ x . We now prove that ^ C k as well. 
Notice that 

El = {qi, k > q hk } n E%_ x DD c k n {f k < /}, 

which, using Ei_ 1 C from the induction hypothesis, 

one gets that 

gi,ft - I/^tt < sizei :fe on > eft.,*,} n E k _ x H 
Accordingly, 

qi,k-l/L n < size ltk <q h k on Ef. 
Further, as Ef is contained in E^_ x , one gets that 

<?i,fc-i -1/L„ < sizei tk -! < qi,k-i on E%. 
Consequently, combining the above, one has 
size ljk - sizei^-x = q k + f k 

>qi,k-Qi,k-i-f-l/L* on E%. 

Consequently, on E k , we have q k > q k , using the expression 
for q k given in ( |42| ), and the fact that < / on 
Combining this with the fact that Ef is contained in A\ k _ 1 , 
since E k C -EjLj and -E^j C from the induction 

hypothesis, one gets that Ef. C A\ k . 

This proves the induction hypothesis. In particular, it holds 
for k — to, which, taking complements, proves the statement 
{TO}. 

Next, we show that B l m C B^ „, . This is straightforward 



IV-E 



that one has f k < f. 



Up 



since recall that from subsection 
for each k. Correspondingly, B\ i7n is contained in B\ m . 

Consequently, from f70"] i and the fact that B\ tm C B\^ m , 
one gets that E m is contained in -Ai iin U B\^ m U £>i. m . This 
proves the lemma. 

Appendix E 
Tails for weighted Bernoulli sums 

Lemma 13. Let Wj, 1 < j < A be N independent 
Bernoulli^-) random variables. Furthermore, let ctj, 1 < j < 
N be non-negative weights that sum to 1 and let N a = 
1/ maxj ctj. Then the weighted sum f — J2j a jWj which 
has mean given by r* — Y^j a j r j> satisfies the following large 
deviation inequalities. For any r with < r < r* , 

P(f < r) < exp{-N a D(r\\r*)} 

and for any f with r* < r < 1, 

P(r >f)< cxp{-A QJ D(f||r*)} 

where D(r\\r*) denotes the relative entropy between Bernoulli 
random variables of success parameters r and r*. 



Proof of Lemma 13 : Let's prove the first part. The proof of 
the second part is similar. 
Denote the event 



A = {W:J2a j W j <r} 
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with W_ denoting the N- vector of W/s. Proceeding as in 
Csiszar [17] we have that 

P(A) = exp{-D(P m A\\Pw)} 

< exp{-J2D(Pw M \\P Wj )} 

3 

Here Pw\A denotes the conditional distribution of the vector 
W conditional on the event A and P\Vj\A denotes the associ- 
ated marginal distribution of Wj conditioned on A. Now 

ajD(P Wj \ A \\P Wj ). 

j j 
Furthermore, the convexity of the relative entropy implies that 

X/'..' /;,:/ ' u •> ii Pw ^ - D ii '^2 a J Pw J 

3 \ 3 3 , 

The sums on the right denote a mixtures of distributions 
p Wj\A an d -fWj) respectively, which are distributions on 
{0, 1}, and hence these mixtures are also distributions on 
{0, 1}. In particular, J2j a jPw., is the Bernoulli(r*) distribu- 
tion and J2j a jPwj\A is the Bernoulli(r e ) distribution where 

r e =B[J2^W j \A] = B[f\A}. 

3 

But in the event A we have f < r so it follows that r e < r. As 
r < r* this yields D(r e || r*) > D(r \\ r*). This completes 



the proof of Lemma 13 



Appendix F 
Lower Bounds on D 

Lemma 14. For p > p*, the relative entropy between 
Bemoulli(p) and Bernoulli (p*) distributions has the succes- 
sion of lower bounds 

D B M\p*) > D Pm (p\\P*) > 2(^-V¥) 2 > iP ~ 2 p )2 

where Dp oi (p\\p*) = plogp/p* +p* — p is also recognizable 
as the relative entropy between Poisson distributions of mean 
p and p* respectively. 

Proof: The Bernoulli relative entropy may be expressed 
as the sum of two positive terms, one of which is plogp/p* + 
p* — p, and the other is the corresponding term with 1— p and 
1 — p* in place of p and p* , so this demonstrates the first 
inequality. Now suppose p > p* . Write plogp/p* + p* — p 
as p*F(s) where F(s) — 2s 2 logs + 1 — s 2 with s 2 = p/p* 
which is at least 1. This function F and its first derivative 
F'(s) = As log s have value equal to at s = 1, and its second 
derivative F"(s) = 4 + 4 logs is at least 4 for s > 1. So by 
second order Taylor expansion F(s) > 2(s — l) 2 for s > 1. 
Thus plogp/p* +p* — p is at least 2(y/p— y/p* ) . Furthermore 
2(s — l) 2 > (s 2 — l) 2 /(2s 2 ) as, taking the square root of both 
sides, it is seen to be equivalent to 2(s — 1) > s 2 — 1, which, 
factoring out s — 1 from both sides, is seen to hold for s > 1. 
From this we have the final lower bound (p — p*) 2 / (2p). ■ 
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